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ABSTRACT 

We apply a recently developed method for classifying broad absorption line quasars 
(BALQSOs) to the latest QSO catalogue constructed from Data Release 5 of the 
Sloan Digital Sky Survey. Our new hybrid classification scheme combines the power 
of simple metrics, supervised neural networks and visual inspection. In our view the 
resulting BALQSO catalogue is both more complete and more robust than all previous 
BALQSO catalogues, containing 3552 sources selected from a parent sample of 28,421 
QSOs in the redshift range 1.7 < z < 4.2. This equates to a raw BALQSO fraction of 
12.5%. 

In the process of constructing a robust catalogue, we shed light on the main prob- 
lems encountered when dealing with BALQSO classification, many of which arise due 
to the lack of a proper physical definition of what constitutes a BAL. This introduces 
some subjectivity in what is meant by the term BALQSO, and because of this, we 
also provide all of the meta-data used in constructing our catalogue, for every object 
in the parent QSO sample. This makes it easy to quickly isolate and explore sub- 
samples constructed with different metrics and techniques. By constructing composite 
QSO spectra from sub-samples classified according to the meta-data, we show that 
no single existing metric produces clean and robust BALQSO classifications. Rather, 
we demonstrate that a variety of complementary metrics are required at the moment 
to accomplish this task. Along the way, we confirm the finding that BALQSOs are 
redder than non-BALQSOs and that the raw BALQSO fraction displays an apparent 
trend with signal-to-noise, steadily increasing from 9% in low signal-to-noise data, up 
to 15%. 
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the subclass of the so-called HiBALs, only displaying ab- 
sorption in certain high-ionisation lines (e.g. N v 1240 A, 
C IV 1549 A, S IV 1397 A). However, some, known as LoB- 
ALs, also show absorption in some low-ionisation lines (most 
notably Mg n 2800 A). 

The most straightforward explanation for the differ- 
ences between QSOs and BALQSOs is a simple orienta- 
tion effect. Thus all QSOs may undergo signi ficant mass loss 
through winds (|Gangulv fc Brothertonl feOQ?). but BALs are 
only observed if the central continuum and/or emission line 
source is viewed directly through the outflowing material. 
Viewed in this context, BALQSOs may be the only available 
tracers of a key physical process common to all AGN. Also, 
the powerful outflows we observe in BALQ SOs are an im- 
porta nt example of AGN feedback in action (|Tremonti et alj 
l2007f ) . Such feedback is a key ingredient required in theoret- 
ical attempts to understand galaxy "downsizing" and may 
also be responsible for regulating the growth of supermas- 
sive black holes. Moreover, the fraction of QSOs displaying 



1 INTRODUCTION 



Broad absorption line quasars (BALQSOs) are a sub- 
class of active galactic nuclei (AGN) exhibiting strong, 
broad and blue-shifted spectroscopic absorption features 
jFoltz et alj Il990l; IWevmann et all Il99ll ; iReichardl l2003bl ; 



iHewett fc Foltzl l2003h . These features are thought to be 
formed in fast (0.1c-0.2c) and powerful outflows from the ac- 
cretion disk around the supermassive black hole at the heart 



of the AGN ((Korista 199 2j). The va st majo rity of BALQSOs 
are ra dio-quiet (|Stocke et al.| [T992. but see lBrotherton et al.l 
2006 for some counter examples), and there are subtle dif- 
ferences between their continuum and emission l ine prop- 
erties and those of normal (non-BAL) QSOs (|Reichardl 
l2003bl ). However, despite these differences, BALQSOs and 
non-BALQS Os appear to be drawn from the same parent 
population (|Reichardl r2003bh . Most BALQSOs belong to 
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BAL features (fBALQSo) may provide a direct estimate of 
the opening angle of these outflows. 

Historically, BALQSO samples have been selected 
on the basis of t he so-called balnicity index (BI; 
IWevmann et al.l Il99ll ) or similar metrics. These samples 
consistently yielded BALQSO fra ction estimates i n the 
range fsALOSQ » 0.10 - 0.15 (IWevmann et all Il99ll ; 
iTolea et al.ll2002l ; iHewett fc Folt j[2003l ; lReichardll2003a! ). In 
a previous paper IjKnigge et al.l 120081 : Paper I), we showed 
that both the BI and a more recently d efined metric, the 
absorption index (AI: iTrump et ai1l2006l ). are biased when 
selecting BALQSOs, the former being incomplete at the low- 
velocity end of the BALQSO distribution, and the latter suf- 
fering from significant contamination by objects with low- 
velocity absorption systems which may be unrelated to the 
higher velocity outflows. 

Here, we use a combination of the classic BI met- 
ric, a simple neural network and visual inspection (the 
hybrid-LVQ approach we developed in Paper I) to pro- 
duce a BALQSO sample that is both more complete 
than purely Bl-based ones and, importantly, signifi- 
cantly more robust than Al-based ones. We have ap- 
plied our hybrid-LVQ algorithm to the QSO sample as- 
sociated with Data Rele ase 5 (DR5) of the Sloa n Dig- 
ital Sky Survey (SD SS lAdelman-McCarthv et all 120071 : 



- 1700 A with 1 A dispersion. It also uses the associated Bis 
for training the neural network and to flag borderline cases 
requiring visual inspection. The BI metric is defined as 



Schneider et al.l I2007T) using the Bis calculated from 
Gibson et al.l l|2009h . The resulting catalogue contains 3552 
BALQSOs selected from a parent sample of 28,421 QSOs 
on the basis of absorption close to the C IV high- 
ionisation emission line. This catalogue may be obtained 
from http://www.astro.soton.ac.uk/~simo. A prelimi- 
nary version o f the c atalogue has already been presented in 
IScaringi et al.l (|200Sf ). In addition, we also provide (at the 
same address) a catalogue of the meta-data, i.e. the data 
pertaining to the parent QSO sample and subsequently used 
in the compilation of our BALQSO catalogue, so that mem- 
bers of the scientific community wishing to compile their 
own BAL/non-BAL subsamples may readily do so. 



2 DATA AND METHODS 

2.1 The QSO parent population 

The SPS S DR5 QSO catalogue contains over 77,000 objects 
in total ()Schneider et al.ll2007h . However, for the purpose 
of constructing a uniform BALQSO catalogue, we only con- 
sider objects whose spectra fully cover the C iv 1550 A reso- 
nance line and its associated absorption region (up to 29000 
km s _1 blueward of the C IV line centre), which displays a 
particularly deep and well-defined absorption trough in the 
spectra of most BALQSOs. Given the wavelength range cov- 
ered by the SDSS spectra, this implies an effective redshift 
window of 1.7 < z < 4.2 for our QSO parent sample. This 
redshift window yields spectra for a QSO parent sample of 
28,421 objects. This will be the parent sample used in this 
study to compile our BALQSO catalogue. 



2.2 Metrics and Preconditioning 

Our BALQSO classification method works on continuum 
normalised spectra covering the wavelength range 1401 A 
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Here, the limits of the integral are in units of km s _1 , f(v) 
is the normalised flux as a function of velocity displacement 
from line centre. The constant C = everywhere, unless 
the normalised flux has satisfied f(v) < 0.9 continuously 
for at least 2000 km s -1 , at which point it is switched to 
C = 1 until f(v) > 0.9 again. Based on this definition, ob- 
jects are classified as BALQSOs if their BI > km s _1 . 
The BI by definition excludes strong, low-velocity absorp- 
tion systems; for example, any deep absorption of width 
3000 km s^ 1 which starts less than 2000 km s^ 1 blueward of 
the rest wavelength of the C IV emission line will be assigned 
BI — km s . Thus BALQSO catalogues constructed us- 
ing the BI metric are likely to be significantly incomplete at 
the low velocity end of the distr i bution . 

For this reason lHall et all (|2002T ) introduced the so- 
called AI (Absorption Index), in an attempt to recover those 
low-velocity absorption systems objects that were missed by 
the BI. The AI is defined as 



(-29000 
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(2) 



here now C = 1 in all regions where f(v) > 0.9 continuously 
for at least 1000 km s _1 and C — otherwise. The two 
key differences that allow objects with BI — km s _1 to 
achieve AI > km s _1 are (i) that the AI includes regions 
within 3000 km s _1 of line centre (and also regions beyond 
25,000 km s _1 ) and (ii) that the AI includes objects with 
much narrower absorption troughs than the BI. The remain- 
ing differences are associated with the presence [absence] of 
the factor 0.9 in Equations [T] and [2] The less stringent con- 
straints imposed by the AI allow one to recover the major- 
ity of the low velocity absorption systems missed by the BI, 
more than doubling the number of objects classed as BALQ- 
SOs. However, as shown in Paper I, the log-AI distribution 
is bi-modal, with low-velocity outflows preferentially occu- 
pying one mode and high-velocity outflows occupying the 
other. While it is beyond doubt that at least some of the 
BALQSOs classified solely by the AI are bona-fide BALQ- 
SOs in the traditional sense, particularly in the region where 
the two modes overlap, it remains uncertain whether the two 
modes are physically connected. Thus we cannot exclude the 
possibility that the AI includes substantial numbers of ob- 
jects whose low-velocity absorption systems are unrelated 
to the high-velocity flows traditionally associated with the 
BALQSO phenomenon. Specific examples of hard-to-classify 
BALQSO spectra selected using either the AI or BI may be 
found in Paper I. 

The classification problems described above are illus- 
trated in Fig. [T] The figure displays QSO geometric mean 
composites created using the DR3 subset from our DR5 par- 
ent population normalised at 1750 AQ. More specifically, it 
shows the average properties of QSO spectra on a grid in 



1 We ha ve used the DR3 sub set so that we can use the AIs pro- 
vided bv lTrump et al. (2006) 
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AI/BI space, allowing a close examination of the absorption 
through dependence on the AI and the BI. For reference 
we have also included in each panel the same non-BALQSO 
composite created from QSOs with AI = km s (dashed 
green curve). 

It is generally clear from Fig. [1] that both the AI and 
the BI tend to select redder QSOs, and that not only do the 
troughs get wider with increasing AI/BI, but also deeper. 
Moreover, Fig. [1] shows how QSO samples selected from 
the low-velocity region of the AI do not display the "tra- 
ditional" BALQSO properties. This is best shown in the 
second composite from the left panel (top) with BI = 
km s _1 and 1 km s" 1 ^/ < 500 km s _1 , which shows little, 
if any, sign of absorption when compared to the non-BAL 
composite. The next panel down displays a composite cre- 
ated from 1561 QSOs (the largest sub-sample) which have 
500 km s _1 < AI < 5000 km s^ 1 and BI = km s -1 . Rel- 
ative to the non-BAL composite, there is some evidence of 
absorption close to the C IV emission line. However, we cau- 
tion that, since, the BAL composite is significantly redder 
than the non-BAL composite, identifying broad absorption 
lines in this spectrum is difficult, without first dereddening 
the spectrum. The remaining panels show spectra with in- 
creasingly prominent absorption in the vicinity of C IV, with 
the absorption strength (depth) and reddening increasing 
both with increasing AI (moving from top to bottom), and 
with increasing BI (left to right). 

We conclude from examining Fig. [T] that using the 
AI > km s _1 to select BALQSOs is unreliable, since 
QSOs with km s _1 < AI < 5000 km s" 1 and BI = 
km s _1 have spectra that are not different from AI — 
km s _1 non-BALQSOs. Moreover, most BALQSOs fall in 
the low-velocity region of both the AI and the BI contin- 
uum, which is also the region which turns out to be the 
hardest to classify. However, it is interesting to note that 
QSOs with AI > 5000 km s _1 and BI < 500 km s _1 do 
look like BALQSOs. We have decided to omit the AI metric 
from our hybrid classification method since about w 50% of 
the objects selected using this metric may not be genuine 
BALQSOs (see Paper I). Instead, our hybrid- LVQ method 
uses the BI, a simple neural network and visual inspection 
to select BALQSOs. 

2.3 Hybrid-LVQ selection of BALQSOs 

The method we use to classify BALQSOs has already been 
described in detail in Paper I, so we only provide an overview 
of the key points here. Briefly, our method is a hybrid 
of Bl-based, neural network and visual classifications, and 
is designed to produce a more complete BALQSO sample 
than a pure BI selection, but without significantly increas- 
ing the number of false positiv es. Starting with a Bl-based 
classification (as calculated bv lGibson et al.ll2009f) . we use 
a simple neural network-based machine lea rning algorithm 
called "learning vector quantization" (LVQ. iKohonenlbOQll ) 
to identify objects that might have been miss-classified by 
the BI. All such objects are then inspected and classified 
visually. 

We caution that both the measured BI (and the AI for 
the reference) are very sensitive to our ability to perform an 
accurate fit to the underlying continuum. Overestimating 
the underlying continuum strength can yield a large posi- 



tive AI and BI in the absence of any absorption. Conversely, 
if the continuum is underestimated, weak broad absorption 
features may go unrecognised. This is an issue which can also 
affect our hybrid-LVQ classification method. F or this reason 
we ha ve decided to use the Bis calculated from I Gibson et all 
(2009), since their continuum fitting algorithm is likely to 
be superior to the one we use for normalising spectra for 
input into LVQ. This is mainly because they employ mul- 
tiple composites in order to fit the underlying continuum 
jTrump et al.ll2006t ). 

For input into LVQ, we n ormalise all QSO sp ectra 
using the method described in iKnigge et al.l (|2008h and 
I North et al.l |2006l ). in which each spectrum is fitted with 
a modified DR3 QSO composite (const ructed from objects 
with A I — km s" 1 as calculated from lTrump et al]|2006l ) 
allowing for object-to-object differences in reddening and 
spectral index. We then bin each spectrum onto a uniform 
grid in wavelength, and use the binned spectrum between 
1401A - 1700A for our classification purposes. 

The way we train our LVQ network to recognize BALQ- 
SOs has been described in detail in Paper I. In brief, we em- 
ploy a training set composed of 400 BI > km s" 1 and 400 
BI = km s _1 QSOs and train our LVQ-network to recog- 
nise BI > km s _1 objects at first. We then visually inspect 
our neuron map for BALQSO mis-classifications (locating 
BI > km s _1 QSOs in BI = km s _1 nodes and vice 
versa) and re-tag those objects. We then retrain our LVQ- 
network using the new BALQSO vs. non-BALQSO tags to 
create a final neuron map. Note that redshift uncertainties 
are explicitly taken into account by our network and all the 
spectra have been de-reddened to match the non-BALQSO 
composite. Below, we will sometimes refer to the full hybrid 
method as LVQ-based, but it is always worth keeping in 
mind that LVQ is only one part of a process also involving 
the BI and visual inspection. 

2.4 The final BALQSO catalogue 

Our LVQ based DR5 BALQSO catalogue contains 3552 ob- 
jects (~ 12.5% of the parent QSO sample). Fig. [2] shows a 
flow diagram detailing the individual steps involved in creat- 
ing this catalogue, along with the numbers of QSOs associ- 
ated with each step. Overall, we find that 3205 QSOs (11.3% 
of the parent sample) are classified as BALQSOs by the BI 
metric (i.e. BI > km s _1 ), and 3282 QSOs (11.5%) are 
classified as BALQSOs by the LVQ network alone (with- 
out visual inspection). The subset of objects classified as 
BALQSOs by both methods comprises 2130 QSOs (7.5%), 
and only these are added to the final catalogue without un- 
dergoing visual inspection. All of the QSOs for which the BI 
and LVQ classifications disagree are inspected and classified 
visually. This step contributes a further 1422 objects (5.0%) 
to the catalogue. 

That BALQSO classification can be difficult is high- 
lighted when one considers the percentage of false iden- 
tifications produced by each of the two automated meth- 
ods (i.e. the BI and LVQ) in isolation. The 3552 BALQ- 
SOs in our final catalogue include 2840 of the 3205 ob- 
jects classified as B ALQSOs by the BI metric calculated by 
iGibson et all (|2009l ). Thus the BI alone would have missed 
20.0% (712/3552) of the objects in our final catalogue and 
produced false positives at a rate of 11.4% (365/3205). Simi- 
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Figure 1. Composites in various AI-BI ranges (blue line), and composites created from Al = km s 1 and BI = km s 1 objects 
(dashed green line). The Al and BI bins on the side panels are in km s — 1 . Reddening has not been taken into account. 
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Figure 2. Flow diagram illustrating the steps involved in our 
hybrid-LVQ classication method. 



larly, LVQ alone would have missed 20.0% (710/3552) of our 
BALQSOs and produced false positives at a rate of 13.4% 
(440/3282). Both methods individually yield very compa- 
rable false identification rates, which highlights the large 
uncertainties associated with previous BALQSO classifica- 
tions. Clearly, designing a fully automated, reliable and rea- 
sonably complete classification scheme for BALQSOs is a 
difficult task. 

In order to explore this issue further, we present Fig 
[3] which shows four QSO spectra that highlight some of the 
subjectivity and difficulty associated with classifying BALQ- 
SOs. The two spectra on the left side both have positive 
Bis an d, as a consequence, are included in the lGibson et all 
(2009) BALQSO catalogue, but not in ours. These were ob- 
jects which were tagged as non-BALQSOs by our LVQ net- 
work and were visually inspected for final classification, since 
they both had positive Bis. The spectrum in the bottom left 
(SDSS J010858.02+005114.6) provides a particularly useful 
insight. Here, the C IV line shows no sign of absorption, and 
thus this object was not classified as a BALQSO by us. How- 
ever, there is some evidence that the Lyman alpha line does 
show reasonable broad, blue-shifted absorption. By contrast, 
the spectra on the right have Bl = 0, despite the fact that 
they show signs of absorption (and are therefore included in 
our catalogue). These objects were recognized by the LVQ 
network and, due to the disagreement between the Bl and 
LVQ verdicts, visually inspected for final classification. 



One last note of caution concerns the rates of false pos- 
itives and negatives among objects that were not visually 
inspected. While the sample of 2227 BALQSOs that were 
inspected visually may be considered to be fairly reliable, 
the samples of non-inspected objects are not as clean. In 
particular, since LVQ and Bl alone produce false positives 
at rates of 11.4% and 13.4%, respectively, we may expect 
1.5% (0.114 x 0.134) of the 2130 BALQSOs on which they 
both agree to be false positives. This amounts to roughly 
33 expected false positives in our BALQSO catalogue. Con- 
versely, both methods miss approximately 20% of BALQ- 
SOs, so they will erroneously agree on a non-BAL classifica- 
tion for 4% (0.2 x 0.2) of true BALQSOs. This amounts to 
roughly 85 false negatives, i.e. 85 BALQSOs that are missing 
from our catalogue. 

Because of the many problems encountered when trying 
to compile BALQSO catalogues, we have decided to pro- 
duce for the scientific community a meta-catalogue which 
includes our whole DR5 parent sample used in this work in- 
stead of just a BALQSO cataloguqj. The first ten entries of 
this meta-catalogue are presented in Table [T] For each QSO 
in our parent sample we provide all tags that we have found 
to be useful in creating our own hybrid-LVQ BALQSO cat- 
alogue. 



3 DISCUSSIONS 

3.1 The classification of borderline cases 

In this section we highlight the difficulties in compiling 
BALQSO samples using composites derived from our QSO 
meta-catalogue as shown in Fig. 0] Each panel displays 
the non-BALQSO composite (shown in dashed green) nor- 
malised to 1750 A and reddened to match the other com- 
posites shown in each panel (solid blue lines), which were 
created by selecting relevant QSO sub-sets culled from the 
meta-catalogue. In the top, row we show composites from 
QSOs in our parent sample which were finally classified as 
BALQSOs by our hybrid-LVQ method, subdivided into ob- 
jects with AI = km s _1 (top-left), Bl = km s _1 (top 
middle) and LVQ non-BAL. All of these composites show 
clear signatures of absorption blueward of C IV. We note 
that, for consistency, we have only used objects already in- 
cluded in SDSS DR3 in constructing the composites shown 
in the left panels , since only these have AI values calculated 
bv lTrump et~ai1 (|2006t ). 

The composite in the upper left panel comprises ob- 
jects with AI = km s -1 (and therefore also Bl — 
km s _1 ) that were classified as BALQSO by the LVQ net- 
work and subsequently confirmed as BALQSOs visually. Al- 
though only 126 objects were used for the creation of this 
composite, the absorption near C IV and the slightly trun- 
cated emission line are clear BALQSO signatures. These 
are mostly BALQSOs whose troughs are smaller that 1000 
km s _1 (and hence with AI — km s _1 ). 

The BALQSO composite in the top middle panel was 
created from objects with Bl = km s _1 , but subsequently 



2 An electronic version can be found at 
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Figure 3. Four QSO spectra with different classification tags from the lGibson et all (l2009h catalogue and ours. The two spectra on the 
left have positive BFs but are not included in our BALQSO catalogue, whilst the two spectra on the right have BI = km s _1 and are 
included in our catalogue. 



Table 1. First 10 objects from our DR5 meta-catalogue. The column names are the same as those used by the SDSS team with the 
exception of the last 2. LVQ_tag is set to 1 if the neural network regarded the QSO as a BALQS O, of not. Final_t ag is set to 1 if 
the QSO is considered as a BALQSO by our hybrid-LVQ method. The Bis have been taken from I Gibson et al. The ts_t_qso 

and ts_t_hiz columns represent Low-z Quasar selection flag and High-z Quasar selection flag respectively as defined by the SDSS team 
dSchneider et alj|2007tl . The catalogue can be found in electronic format from http://www.astro.soton.ac.uk/~simo and the VizieR 
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Figure 4. Various composites created in order to examine our BALQSO classification parameter space. The solid blue line displays 
composites created by selecting different QSOs in our classification parameter space. The dashed green line is a composite created with 
AI = km s _1 QSOs after being normalised at 1750A and de-reddened to match the composites in each panel. 
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identified as BALQSOs by the LVQ neural network. This 
composite shows the largest amount of reddening and a 
fairly narrow and deep absorption blueward of C IV. Such 
narrow features will, by definition, be classified as non-BALs 
using the BI metric since they lie within 3000 km s _1 of the 
line centre.. In the upper right panel we show the compos- 
ite formed from QSOs with BI > km s _1 , identified as 
non-BALs by the LVQ network, but finally included in the 
catalogue on the basis of visual inspection. This composite 
shows a strong, broad, absorption line blueward of C IV, but 
very little reddening. Furthermore, since the composites in 
both the top right and top middle panels contain similar 
numbers of QSOs, it is evident that neither method alone is 
as reliable in identifying BALQSOs as one might hope. All 
of the information needed to recreate these composites can 
be found in our meta-catalogue by querying on various tags. 

The bottom row of Fig. [4] compares those composites 
comprising QSOs classified as BALQSO candidates using 
only a single metric that were then re-classified as non-BALs 
after visual inspection. In detail, the composite at bottom 
left is created from objects with AI > km s _1 , but finally 
classified as non-BALs by our hybrid-LVQ method. There is 
virtually no evidence for absorption in this composite spec- 
trum, despite all of the 2362 comprising this composite hav- 
ing AI > km s -1 . This again points out the problematic 
nature of the AI metric for BALQSO selection purposes. 

The middle bottom panel displays a composite created 
from objects having BI > km s _1 (and therefore also 
AI > km s _1 ) and a non-BALQSO LVQ tag, finally clas- 
sified by us as a non-BALQSO. This composite looks very 
similar to the one on the top right, which was already dis- 
cussed above. Here again, we see a fairly strong, smooth and 
broad (> 3000km s~ 10 ) absorption feature associated with 
C IV. The similarity between these two composites is not en- 
tirely unexpected, since all of the objects forming them had 
the same automated classifications (positive BI and a non- 
LVQ tag), and thus differed only in the outcome of the visual 
inspection step. Since disagreement between BI and LVQ is 
most likely to happen for difficult borderline cases, we should 
certainly expect some mis-classifications and thus overlap 
between the two sub-sets of QSOs represented by these com- 
posites. However, closer inspection does reveal some signif- 
icant differences between the composites that point to the 
subtle, but consistent absorption line properties that were 
obviously picked by the visual classification step. For exam- 
ple, the peaks of the C IV and Lyman-a lines are lower in the 
top right composite than in the bottom middle one, and only 
the top right one shows clear evidence of absorption eating 
into the blue wing of the C IV line (compared to the non- 
BAL composite). Moreover, even though both composites 
show some evidence for absorption affecting the bluest part 
of the spectrum - shortward of Lyman alpha, and particu- 
larly around the Lyman beta and O VI blend near 1030 A - 
this absorption is stronger in the top right panel. Finally, the 
broad absorption trough associated with C IV in the bottom 
middle panel is suspiciously symmetric between 2000 km s _1 
and 20, 000 km s -1 , the limits within which the BI is calcu- 
lated. This may indicate that this trough is formed from the 
superposition of many narrow lines that may or may not be 
associated with the traditional BAL-flow. 

All of these differences are consistent with the idea that 
the objects represented in the top right panel (which is in- 



cluded in our final BALQSO catalogue) are more likely to 
be genuine BALQSOs than those represented in the bot- 
tom middle panel (which are not included in our final cata- 
logue). However, there is no escaping the fact that the dif- 
ferences are extremely subtle and that a definitive classifica- 
tion scheme for such borderline cases remains elusive. This 
conclusion is supported by a visual re-inspection of all of 
the objects contained in these two sub-samples: while we 
generally remain happy with our classifications as "best-bet 
estimates" , it is clear that in many cases a definitive classi- 
fication is imp ossible. Since the 36 5 borderline cases repre- 
sent 11% of the lGibson et all (|2009h sample, we caution that 
there is a systematic uncertainty of ~11% on the BALQSO 
fraction suggested by even the best presently available clas- 
sification schemes. This is one of the key reasons we have 
decided to provide the community with all of the meta-data 
we have used in constructing our own catalogue. 

The last figure on the bottom right displays a composite 
created by selecting QSOs originally classified as BALQSOs 
by our neural network but re-classified as non-BALQSOs 
during the visual inspection phase (these objects all had 
BI — km s _1 by definition or they would have not been 
inspected visually). The composite here is somewhat red- 
der than the non-BAL (E(B - V ) = 0.04), but no clear 
signatures of absorption are present. This highlights the 
importance of a visual inspection phase when constructing 
BALQSO catalogues. 

To summarise, it is clear that no single metric (or vi- 
sual intervention) is adequate in deriving both complete and 
clean samples of BALQSOs at the moment, so a variety 
of complementary metrics should instead be employed. Our 
own experience with unsupervised and supervised learning 
networks shows that, even though much of the classification 
work may indeed be automated, human intervention is not 
only useful, it is often a necessity when dealing with classi- 
fication involving not so clearly defined training samples. 

3.2 The effect of S/N 

Fig-E]shows f balqso as a function of signal-to-noise for BI- 
selected QSOs, LVQ selected QSOs and our final BALQSO 
fraction using our hybrid-LVQ m ethod. The same trend as 
that found bv lGibson etafl (|2009h is evident for BI selected 
BALQSOs: /balqso steadily increases from m 9% in low 
signal-to-noise data up to 15% in high signal-to-noise data. 
We suspect that this is because in low signal-to-noise data 
even relatively small random fluctuations in a shallow BAL 
trough can trigger the zero reset in the BI calculation and 
can thus result in BI = km s . We note that this would 
not necessarily be the case if BALs were identified using a 
more sophisticated metric than the BI to isolate the BAL. 
We cannot rule out, however, that the apparent trend in 
the BAL fraction with S/N has a more interesting cause, 
such as an underlying trend with redshift or luminosity (i.e. 
the BAL fraction may be higher among low-redshift and/or 
high- luminosity QSOs, which would also have higher S/N 
spectra, on average). However, the simpler and more mun- 
dane explanation - that the trend is primarily due to the 
difficulty in identifying BAL features in low-S/N spectra - 
seems far more likely. We have also visually inspected some 
of the objects with high BI that are not included in our fi- 
nal catalogue and conclude that these are cases where the 
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Figure 5. f BALQS O a s a functio n of signal-to-noise calculated 
in the same way as iGibson et al. (2009) for Bl-selected QSOs, 
hybrid-LVQ selected QSOs and the final BALQSOs included in 
our final hybrid-LVQ catalogue. 



BI must have been calculated incorrectly and should have 
been set to BI = km s _1 . 

By contrast, the BALQSO fraction produced by LVQ 
alone at high S/N levels is roughly constant, and slightly 
lower than the fraction suggested by the BI or indicated by 
our final catalogue. Thus the maximum efficiency of LVQ 
(when working on high-quality spectra) is comparable to, 
but slightly less, than that of the BI. Fig. [5] also shows that 
the LVQ-suggested BALQSO fraction actually increases to- 
wards the lowest S/N levels. Given that the number of low- 
S/N BALQSOs suggested by LVQ alone is actually higher 
than that in our final catalogue, and that every LVQ-selected 
BALQSO candidate was either included in the catalogue or 
rejected as a false positive via visual inspection, this im- 
plies that LVQ has a tendency to classify low-S/N spectra 
as BALQSOs, leading to a higher false positive rate in this 
limit. This is not entirely unexpected and actually means 
that LVQ and BI selections are highly complementary meth- 
ods when applied across the full range of S/N levels. 



catalogue, we have explored in detail the classification pa- 
rameter space for BALQSOs and highlighted the difficul- 
ties in BALQSO classification using single metrics. We have 
also constructed - and make available - a meta-cataloguc 
that contains all of the information needed to recreate our 
BALQSO catalogue from its much larger QSO parent sam- 
ple, or to create alternative BALQSO samples using dif- 
ferent selection criteria. In addition, all of the composite 
spectra shown in this paper will be made publicly avail- 
able. Meta-catalogues provide an elegant solution to prob- 
lems encountered re garding subjectivity and transparency 
(Hog g~fc Land I^OOS - ). in particular when dealing with "ill- 
defined" astronomical objects such as BALQSOs. 
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4 CONCLUSIONS 

We have used a recently developed technique for identifying 
broad absorption lines in quasar spectra to compile a more 
robust and complete BALQSOs catalogue. Our technique is 
based on a combination of the traditional "balnicity index" , 
a simple neural network and visual inspection of borderline 
cases and is designed to produce BALQSO samples that are 
more complete than purely Bl-based ones, while still avoid- 
ing a high incidence of false positives. Our final catalogue 
covers the redshift range 1.7 < z < 4.2 and contains 3552 
BALQSOs, corresponding to a raw fraction of ~ 12.5% of 
the SDSS DR5 QSOs parent sample with a false positive rate 
of ~11%. In the process of constructing a robust BALQSO 
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